A Proactive Fault Tolerance Scheme for Large Scale Storage Systems

نویسندگان

Xinpu Ji

Yuxiang Ma

Rui Ma

Peng Li

Jingwei Ma

Gang Wang

Xiaoguang Liu

Zhongwei Li

چکیده

Facing increasingly high failure rate of drives in data centers, reactive fault tolerance mechanisms alone can hardly guarantee high reliability. Therefore, some hard drive failure prediction models that can predict soon-to-fail drives in advance have been raised. But few researchers applied these models to distributed systems to improve the reliability. This paper proposes SSM (Self-Scheduling Migration) which can monitor drives’ health status and reasonably migrate data from the soon-to-fail drives to others in advance using the results produced by the prediction models. We adopt a self-scheduling migration algorithm into distributed systems to transfer the data from soon-to-fail drives. This algorithm can dynamically adjust the migration rates according to drives’ severity level, which is generated from the realtime prediction results. Moreover, the algorithm can make full use of the resources and balance the load when selecting migration source and destination drives. On the premise of minimizing the side effects of migration to system services, the migration bandwidth is reasonably allocated. We implement a prototype based on Sheepdog distributed system. The system only sees respectively 8% and 13% performance drops on read and write operations caused by migration. Compared with reactive fault tolerance, SSM significantly improves system reliability and availability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design of Fault-Tolerant Large-Scale VOD Servers: With Emphasis on High-Performance and Low-Cost

ÐRecent technological advances in digital signal processing, data compression techniques, and high-speed communication networks have made Video-on-Demand (VOD) servers feasible. A challenging task in such systems is servicing multiple clients simultaneously while satisfying real-time requirements of continuous delivery of objects at specified rates. To accomplish these tasks and realize economi...

متن کامل

Proactive Service Migration for Long-Running Byzantine Fault Tolerant Systems

In this paper, we describe a novel proactive recovery scheme based on service migration for long-running Byzantine fault tolerant systems. Proactive recovery is an essential method for ensuring long term reliability of fault tolerant systems that are under continuous threats from malicious adversaries. The primary benefit of our proactive recovery scheme is a reduced vulnerability window. This ...

متن کامل

Failure prediction for HPC systems and applications: Current situation and open issues

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault’s effects on applications. By far the most popular technique is the checkpoint–restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. Th...

متن کامل

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures ...

متن کامل

Towards a Secure Fragment Allocation of Files in Heterogeneous Distributed Systems

There is a growing demand for large-scale distributed storage systems to support resource sharing and fault tolerance. Although heterogeneity issues of distributed systems have been widely investigated, little attention has been given to security solutions designed for distributed storage systems with heterogeneous vulnerabilities. To address this issue, we design a Secure Fragment Allocation S...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

A Proactive Fault Tolerance Scheme for Large Scale Storage Systems

نویسندگان

چکیده

منابع مشابه

Design of Fault-Tolerant Large-Scale VOD Servers: With Emphasis on High-Performance and Low-Cost

Proactive Service Migration for Long-Running Byzantine Fault Tolerant Systems

Failure prediction for HPC systems and applications: Current situation and open issues

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

Towards a Secure Fragment Allocation of Files in Heterogeneous Distributed Systems

عنوان ژورنال:

اشتراک گذاری